facial palsy
A Multimodal Fusion Model Leveraging MLP Mixer and Handcrafted Features-based Deep Learning Networks for Facial Palsy Detection
Oo, Heng Yim Nicole, Lee, Min Hun, Lim, Jeong Hoon
Algorithmic detection of facial palsy offers the potential to improve current practices, which usually involve labor-intensive and subjective assessments by clinicians. In this paper, we present a multimodal fusion-based deep learning model that utilizes an MLP mixer-based model to process unstructured data (i.e. RGB images or images with facial line segments) and a feed-forward neural network to process structured data (i.e. facial landmark coordinates, features of facial expressions, or handcrafted features) for detecting facial palsy. We then contribute to a study to analyze the effect of different data modalities and the benefits of a multimodal fusion-based approach using videos of 20 facial palsy patients and 20 healthy subjects. Our multimodal fusion model achieved 96.00 F1, which is significantly higher than the feed-forward neural network trained on handcrafted features alone (82.80 F1) and an MLP mixer-based model trained on raw RGB images (89.00 F1).
- Asia > Singapore > Central Region > Singapore (0.04)
- North America > United States (0.04)
Exploring a Multimodal Fusion-based Deep Learning Network for Detecting Facial Palsy
Oo, Nicole Heng Yim, Lee, Min Hun, Lim, Jeong Hoon
Facial palsy has serious consequences on patients, such as diminished feeding function, psychological distress, and social withdrawal [10]. For diagnosis of facial palsy, clinicians usually conduct observation-based physical examinations [7]. However, it is challenging to quantify symptom intensity and variation, measure changes in these symptoms between visits for a single patient, and measure differences in symptoms across different patients at the same time [8]. To address this challenge, researchers have explored various algorithmic approaches to detect facial palsy [8, 9, 15, 17]. These approaches broadly fall into two categories: 1) those employing machine learning models with manual feature extraction and 2) those that leverage deep learning-based models. For approaches with manual features, Ngo et al. [15] proposed a frequency-based technique using limited-orientation modified circular Gabor filters (LO-MCGFS) to magnify desired frequencies in dataset images and extract features from rotation invariant texture regions for classifying facial palsy. In addition, researchers explored to train a data-driven model [6, 9, 17] to detect facial key points and computed features, such as the displacement ratio between left and right halves of the face or the motion information of facial regions. Alternatively, researchers discussed the limitations of using manual features and leveraged RGB images or images with facial line segments to train deep learning-based models (e.g.
3DPalsyNet: A Facial Palsy Grading and Motion Recognition Framework using Fully 3D Convolutional Neural Networks
Storey, Gary, Jiang, Richard, Keogh, Shelagh, Bouridane, Ahmed, Li, Chang-Tsun
The capability to perform facial analysis from video sequences has significant potential to positively impact in many areas of life. One such area relates to the medical domain to specifically aid in the diagnosis and rehabilitation of patients with facial palsy. With this application in mind, this paper presents an end-to-end framework, named 3DPalsyNet, for the tasks of mouth motion recognition and facial palsy grading. 3DPalsyNet utilizes a 3D CNN architecture with a ResNet backbone for the prediction of these dynamic tasks. Leveraging transfer learning from a 3D CNNs pre-trained on the Kinetics data set for general action recognition, the model is modified to apply joint supervised learning using center and softmax loss concepts. 3DPalsyNet is evaluated on a test set consisting of individuals with varying ranges of facial palsy and mouth motions and the results have shown an attractive level of classification accuracy in these task of 82% and 86% respectively. The frame duration and the loss function affect was studied in terms of the predictive qualities of the proposed 3DPalsyNet, where it was found shorter frame duration's of 8 performed best for this specific task. Centre loss and softmax have shown improvements in spatio-temporal feature learning than softmax loss alone, this is in agreement with earlier work involving the spatial domain.
- Europe > United Kingdom > England > Tyne and Wear > Newcastle (0.14)
- Oceania > Australia (0.04)
- North America > United States > New York (0.04)
- (7 more...)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Education (0.93)